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Abstract 

This paper addresses semantic image segmentation by 
incorporating rich information into Markov Random Field 
(MRF), including high-order relations and mixture of label 
contexts. Unlike previous works that optimized MRFs 
using iterative algorithm, we solve MRF by proposing a 
Convolutional Neural Network (CNN), namely Deep Pars¬ 
ing Network (DPN), which enables deterministic end-to- 
end computation in a single forward pass. Specifically, 
DPN extends a contemporary CNN architecture to model 
unary terms and additional layers are carefully devised to 
approximate the mean field algorithm (MF) for pairwise 
terms. It has several appealing properties. First, different 
from the recent works that combined CNN and MRF, where 
many iterations of MF were required for each training 
image during back-propagation, DPN is able to achieve 
high performance by approximating one iteration of MF. 
Second, DPN represents various types of pairwise terms, 
making many existing works as its special cases. Third, 
DPN makes MF easier to be parallelized and speeded up 
in Graphical Processing Unit (GPU). DPN is thoroughly 
evaluated on the PASCAL VOC 2012 dataset, where a sin¬ 
gle DPN model yields a new state-of-the-art segmentation 
accuracy of 77.5%. 

1. Introduction 

Markov Random Field (MRF) or Conditional Random 
Field (CRF) has achieved great successes in semantic im¬ 
age segmentation, which is one of the most challenging 
problems in computer vision. Existing works such as 
[31, 29, 9, 34, 11, 2, 8, 25, 22] can be generally categorized 
into two groups based on their definitions of the unary and 
pairwise terms of MRF. 

In the first group, researchers improved labeling ac¬ 
curacy by exploring rich information to define the pair¬ 
wise functions, including long-range dependencies [16, 17], 
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high-order potentials [37, 36], and semantic label contexts 
[21, 26, 38]. For example, Krahenbuhl et al. [16] attained 
accurate segmentation boundary by inferring on a fully- 
connected graph. Vineet et al. [37] extended [16] by 
defining both high-order and long-range terms between 
pixels. Global or local semantic contexts between labels 
were also investigated by [38]. Although they accomplished 
promising results, they modeled the unary terms as SVM or 
Adaboost, whose learning capacity becomes a bottleneck. 
The learning and inference of complex pairwise terms are 
often expensive. 

In the second group, people learned a strong unary clas¬ 
sifier by leveraging the recent advances of deep learning, 
such as the Convolutional Neural Network (CNN). With 
deep models, these works [23, 24, 25, 22, 3, 28, 39, 30, 19] 
demonstrated encouraging results using simple definition of 
the pairwise function or even ignore it. For instance. Long 
et al. [22] transformed fully-connected layers of CNN into 
convolutional layers, making accurate per-pixel classifica¬ 
tion possible using the contemporary CNN architectures 
that were pre-trained on ImageNet [6]. Chen et al. [3] 
improved [22] by feeding the outputs of CNN into a MRF 
with simple pairwise potentials, but it treated CNN and 
MRF as separated components. A recent advance was 
obtained by [30], which jointly trained CNN and MRF by 
passing the error of MRF inference backward into CNN, but 
iterative inference of MRF such as the mean field algorithm 
(MF) [27] is required for each training image during back- 
propagation (BP). Zheng et al. [39] further showed that 
the procedure of MF inference can be represented as a 
Recurrent Neural Network (RNN), but their computational 
costs are similar. We found that directly combing CNN 
and MRF as above is inefficient, because CNN typically 
has millions of parameters while MRF infers thousands of 
latent variables; and even worse, incorporating complex 
pairwise terms into MRF becomes impractical, limiting the 
performance of the entire system. 

This work proposes a novel Deep Parsing Network 
(DPN), which is able to jointly train CNN and complex 
pairwise terms. DPN has several appealing properties. 
(1) DPN solves MRF with a single feed-forward pass. 



reducing computational cost and meanwhile maintaining 
high performance. Specifically, DPN models unary terms 
by extending the VGG-16 network (VGGie) [32] pre¬ 
trained on ImageNet, while additional layers are carefully 
designed to model complex pairwise terms. Learning of 
these terms is transformed into deterministic end-to-end 
computation by BP, instead of embedding MF into BP as 
[30, 19] did. Although MF can be represented by RNN [39], 
it needs to recurrently compute the forward pass so as to 
achieve good performance and thus is time-consuming, e.g. 
each forward pass contains hundred thousands of weights. 
DPN approximates MF by using only one iteration. This 
is made possible by joint learning strong unary terms and 
rich pairwise information. (2) Pairwise terms determine 
the graphical structure. In previous works, if the former is 
changed, so is the latter as well as its inference procedure. 
But with DPN, modifying the complexity of pairwise terms, 
e.g. range of pixels and contexts, is as simple as modifying 
the receptive fields of convolutions, without varying BP. 
DPN is able to represent multiple types of pairwise terms, 
making many previous works [3, 39, 30] as its special 
cases. (3) DPN approximates MF with convolutional and 
pooling operations, which can be speeded up by low- 
rank approximation [14] and easily parallelized [4] in a 
Graphical Processing Unit (GPU). 

Our contributions are summarized as below. (1) A 
novel DPN is proposed to jointly train VGGie and rich 
pairwise information, i.e. mixture of label contexts and 
high-order relations. Compared to existing deep models, 
DPN can approximate MF with only one iteration, reducing 
computational cost but still maintaining high performance. 
(2) We disclose that DPN represents multiple types of 
MRFs, making many previous works such as RNN [39] and 
DeepLab [3] as its special cases. (3) Extensive experiments 
investigate which component of DPN is crucial to achieve 
high performance. A single DPN model achieves a new 
state-of-the-art accuracy of 77.5% on the PASCAL VOC 
2012 [ ] test set. (4) We analyze the time complexity of 
DPN on GPU. 

2. Our Approach 

DPN learns MRF by extending VGGie to model unary 
terms and additional layers are carefully designed for pair¬ 
wise terms. 

Overview MRF [10] is an undirected graph where each 
node represents a pixel in an image I, and each edge repre¬ 
sents relation between pixels. Each node is associated with 
a binary latent variable, yl^ G {0,1}, indicating whether 
a pixel i has label u. We have Mu ^ L = {1,2, 
representing a set of I labels. The energy function of MRF 
is written as 

^(y) = E+ E (1) 

ViGV 


where y, V, and £ denote a set of latent variables, nodes, 
and edges, respectively, is the unary term, measuring 

the cost of assigning label u to the i-th pixel. For instance, 
if pixel i belongs to the first category other than the second 
one, we should have ^{y}) < )• Moreover, , yj) 

is the pairwise term that measures the penalty of assigning 
labels u^v to pixels i, j respectively. 

Intuitively, the unary terms represent per-pixel classifica¬ 
tions, while the pairwise terms represent a set of smoothness 
constraints. The unary term in Eqn.(l) is typically defined 
as 

$(yr) = -lnp(2/“ = l|I) (2) 

where p{y^ = 1|I) indicates the probability of the presence 
of label u at pixel i, modeling by VGGie. To simplify 
discussions, we abbreviate it as pf. The smoothness term 
can be formulated as 

'^{yi,yj) = y‘{u,v)d{i,j), (3) 

where the first term learns the penalty of global co¬ 
occurrence between any pair of labels, e.g. the output value 
of p{u,v) is large if u and v should not coexist, while the 
second term calculates the distances between pixels, e.g. 
d(«,j) = wi||Ii - Ij||2 +L 02 \\[xi Vi] - [xj %]||2. Here, 
li indicates a feature vector such as RGB values extracted 
from the i-th pixel, x^y denote coordinates of pixels’ 
positions, and co’i,co ’2 are the constant weights. Eqn.(3) 
implies that if two pixels are close and look similar, they 
are encouraged to have labels that are compatible. It has 
been adopted by most of the recent deep models [3, 39, 30] 
for semantic image segmentation. 

However, Eqn.(3) has two main drawbacks. First, its 
first term captures the co-occurrence frequency of two 
labels in the training data, but neglects the spatial context 
between objects. For example, ‘person’ may appear beside 
‘table’, but not at its bottom. This spatial context is a 
mixture of patterns, as different object configurations may 
appear in different images. Second, it defines only the 
pairwise relations between pixels, missing their high-order 
interactions. 

To resolve these issues, we define the smoothness term 
by leveraging rich information between pixels, which is one 
of the advantages of DPN over existing deep models. We 
have 

K 

^{y't,y'j) = ^^kyk{i,u,j,v) E (4) 

k=l \/z£j\fj 

The first term in Eqn.(4) learns a mixture of local label 
contexts, penalizing label assignment in a local region, 
where K is the number of components in mixture and Xk 
is an indicator, determining which component is activated. 
We define Xk G {0,1} and Ylk=i = 1- An intuitive 


^ d(j,z)p^^ 



Figure 1: (a) Illustration of the pairwise terms in DPN. (b) explains the 
label contexts, (c) and (d) show that mean field update of DPN corresponds 
to convolutions. 

illustration is given in Fig.l (b), where the dots in red and 
blue represent a center pixel i and its neighboring pixels j, 
i.e. j ^ Mi, and (i, u) indicates assigning label u to pixel i. 
Here, outputs labeling cost between and 

{j^v) with respect to their relative positions. For instance, 
if u^v represent ‘person’ and ‘table’, the learned penalties 
of positions j that are at the bottom of center i should be 
large. The second term basically models a triple penalty, 
which involves pixels i, j, and j’s neighbors, implying that 
if (i, u) and (j, v) are compatible, then (i, u) should be also 
compatible with j’s nearby pixels {z,v),\/z e Mj , as shown 
in Fig.l (a). 

Learning parameters (i.e. weights of VGGie and costs 
of label contexts) in Eqn.(l) is to minimize the distances 
between ground-truth label map and y, which needs to be 
inferred subject to the smoothness constraints. 

Inference Overview Inference of Eqn.(l) can be 
obtained by the mean field (ME) algorithm [27], 
which estimates the joint distribution of MRE, 
P(y) = ^ exp{—E^(y)}, by using a fully-factorized 

proposal distribution, Q(y) = llvievnvueL^*"’ where 
each q'^ is a variable we need to estimate, indicating the 
predicted probability of assigning label u to pixel i. To 
simplify the discussion, we denote ^(Vi) and yj) as 

and respectively. Q(y) is typically optimized by 
minimizing a free energy function [15] of MRE, 

piQ) = E E9“^“+ E E E^Njn" 

viev Vugl yijes \/ueL \/veL 

+ E E (5) 

WievwueL 

Specifically, the first term in Eqn.(5) characterizes the cost 
of each pixel’s predictions, while the second term char¬ 
acterizes the consistencies of predictions between pixels. 


The last term is the entropy, measuring the confidences of 
predictions. To estimate qf, we differentiate Eqn.(5) with 
respect to it and equate the resulting expression to zero. We 
then have a closed-form expression, 

K exp {-($“+ ^ ^ (6) 

yjeATi Vugl 

such that the predictions for each pixel is independently 
attained by repeating Eqn.(6), which implies whether pixel i 
have label u is proportional to the estimated probabilities of 
all its neighboring pixels, weighted by their corresponding 
smoothness penalties. Substituting Eqn.(4) into (6), we 
have 

K 

oc exp{-$“-^Afe^ ^ (7) 

k=i \/veL\fjeMi 

where each q'^ is initialized by the corresponding p'f in 
Eqn.(2), which is the unary prediction of VGGie. Eqn.(7) 
satisfies the smoothness constraints. 

In the following, DPN approximates one iteration of 
Eqn.(7) by decomposing it into two steps. Let be a 
predicted label map of the ^-th category. In the first step 
as shown in Eig.l (c), we calculate the triple penalty term 
in (7) by applying a m x m filter on each position j, where 
each element of this filter equals d(j, z)q'j, resulting in 
Apparently, this step smoothes the prediction of pixel j with 
respect to the distances between it and its neighborhood. In 
the second step as illustrated in (d), the labeling contexts 
can be obtained by convolving with a n x n filter, each 
element of which equals j, v), penalizing the triple 

relations as shown in (a). 

3. Deep Parsing Network 

This section describes the implementation of Eq.(7) in 
a Deep Parsing Network (DPN). DPN extends VGGie as 
unary term and additional layers are designed to approxi¬ 
mate one iteration of ME inference as the pairwise term. 
The hyper-parameters of VGGie and DPN are compared in 
Table 1. 

VGGi6 As listed in Table 1 (a), the first row represents 
the name of layer and 'x-y' in the second row represents 
the size of the receptive field and the stride of convolution, 
respectively. Eor instance, ‘3-1’ in the convolutional layer 
implies that the receptive field of each filter is 3 x 3 and it 
is applied on every single pixel of an input feature map, 
while ‘2-2’ in the max-pooling layer indicates each feature 
map is pooled over every other pixel within a 2x2 local 
region. The last three rows show the number of the output 
feature maps, activation functions, and the size of output 

































(a) VGGie: 224 x 224 x 3 input image’, 1X1000 output labels 
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#channel 

64 
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512 
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activation 

relu 

idn 

relu 

idn 

relu 
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relu 

idn 

relu 

idn 

relu 
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size 

224 

112 

112 

56 

56 
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28 

14 

14 

7 

4096 

1000 




(b)DPN: 512x512x3 input image’, 512x512x21 output label maps 
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filter-stride 
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1-1 

50-1 
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^channel 

64 

64 
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512 
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4096 

4096 

21 

21 
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21 

21 

activation 

relu 

idn 

relu 

idn 

relu 

idn 

relu 

relu 

relu 

relu 

sigm 

lin 

lin 

idn 

soft 

size 

512 

256 

256 

128 

128 

64 

64 

64 

64 

64 

512 

512 

512 

512 

512 


Table 1: The comparisons between the network architectures of VGGie and DPN, as shown in (a) and (b) respectively. Each table contains five rows, 
representing the ‘name of layer’, ‘receptive field of filter’ — ‘stride’, ‘number of output feature maps’, ‘activation function’ and ‘size of output feature 

maps’, respectively. Furthermore, ‘conv’, ‘lconv’,‘max’, ‘bmin’, ‘fc’, and ‘sum’ represent the convolution, local convolution, max pooling, block min 
pooling, fully connection, and summation, respectively. Moreover, ‘relu’, ‘idn’, ‘soft’, ‘sigm’, and ‘lin’ represent the activation functions, including rectified 
linear unit [18], identity, softmax, sigmoid, and linear, respectively. 


feature maps, respectively. As summarized in Table 1 (a), 
VGGi 6 contains thirteen convolutional layers, five max¬ 
pooling layers, and three fully-connected layers. These 
layers can be partitioned into twelve groups, each of which 
covers one or more homogenous layers. For example, the 
first group comprises two convolutional layers with 3x3 
receptive field and 64 output feature maps, each of which 
is 224x224. 

3.1. Modeling Unary Terms 

To make full use of VGGie, which is pre-trained by 
ImageNet, we adopt all its parameters to initialize the 
filters of the first ten groups of DPN. To simplify the 
discussions, we take PASCAL VOC 2012 (VOC12) [ ] as 
an example. Note that DPN can be easily adapted to any 
other semantic image segmentation dataset by modifying 
its hyper-parameters. VOC 12 contains 21 categories and 
each image is rescaled to 512x512 in training. Therefore, 
DPN needs to predict totally 512x512x21 labels, i.e. one 
label for each pixel. To this end, we extends VGGie in two 
aspects. 

In particular, let ai and hi denote the i-th group in Table 
1 (a) and (b), respectively. First, we increase resolution of 
VGGie by removing its max pooling layers at a8 and alO, 
because most of the information is lost after pooling, e.g. 
alO reduces the input size by 32 times, i.e. from 224x224 
to 7x7. As a result, the smallest size of feature map in 
DPN is 64 X 64, keeping much more information compared 
with VGGie. Note that the filters of b8 are initialized as 
the filters of a9, but the 3 x 3 receptive field is padded into 
5x5 as shown in Fig.2 (a), where the cells in white are the 
original values of the a9’s filter and the cells in gray are 
zeros. This is done because a8 is not presented in DPN, such 
that each filter in a9 should be convolved on every other 
pixel of a7. To maintain the convolution with one stride, we 
pad the filters with zeros. Furthermore, the feature maps in 
bll are up-sampled to 512x512 by bilinear interpolation. 



Figure 2: (a) and (b) show the padding of the filters, (c) illustrates local 
convolution of bl2. 

Since DPN is trained with label maps of the entire images, 
the missing information in the preceding layers of bll can 
be recovered by BP. 

Second, two fully-connected layers at all are trans¬ 
formed to two convolutional layers at b9 and blO, respec¬ 
tively. As shown in Table 1 (a), the first Tc’ layer learns 
7x7x512x4096 parameters, which can be altered to 4096 
filters in b9, each of which is 25 x 25 x 512. Since a8 and alO 
have been removed, the 7 x 7 receptive field is padded into 
25 x25 similar as above and shown in Fig.2 (b). The second 
Tc’ layer learns a 4096x4096 weight matrix, corresponding 
to 4096 filters in blO. Each filter is 1 x 1 x4096. 

Overall, bll generates the unary labeling results, pro¬ 
ducing twenty-one 512x512 feature maps, each of which 
represents the probabilistic label map of each category. 

3.2. Modeling Smoothness Terms 

The last four layers of DPN, i.e. from bl2 to bl5, are 
carefully designed to smooth the unary labeling results. 

• b 12 As listed in Table 1 (b), Tconv’ in bl2 indicates 
a locally convolutional layer, which is widely used in 
face recognition [33, 35] to capture different information 
from different facial positions. Similarly, distinct spatial 
positions of bl2 have different filters, and each filter is 
shared across 21 input channels, as shown in Fig.2 (c). It 
can be formulated as 

°U,v) =lin(k(j» 


( 8 ) 




































































(a) Convolution of bl3 


(b) Pooling in bl4 


512 



Figure 3: (a) and (b) illustrates the convolutions of bl3 and the poolings 
in bl4. 


where lin(x) = ax ^ b representing the linear activation 
function, is the convolutional operator, and ^(j^v) is 
a 50 x50xl filter at position j of channel v. We have 
1 ) = 2 ) = ... = k(j, 2 i) shared across 21 channels, 

indicates a local patch in bll, while is the 

corresponding output of bl2. Since bl2 has stride one, 
the result of kj * v) i^ ^ scalar. In summary, bl2 has 
512x512 different Mters and produces 21 output feature 
maps. 

Eqn.(8) implements the triple penalty of Eqn.(7). Recall 
that each output feature map of bl 1 indicates a probabilistic 
label map of a specific object appearing in the image. As 
a result, Eqn.(8) suggests that the probability of object v 
presented at position j is updated by weighted averaging 
over the probabilities at its nearby positions. Thus, as shown 
in Fig.l (c), corresponds to a patch of centered at 
j, which has values Mz G . Similarly, 

is initialized by d(jf, zjpj, implying each filter captures 
dissimilarities between positions. These filters remain fixed 
during BP, other than learned as in conventional CNN^. 

• bl3 As shown in Table 1 (b) and Fig.3 (a), bl3 is 
a convolutional layer that generates 105 feature maps by 
using 105 filters of size 9x9x21. For example, the value 
of {i,u = 1) is attained by applying a 9x9x21 filter at 
positions {{j^v = 1,...,21)}. In other words, bl3 learns 
a filter for each category to penalize the probabilistic label 
maps of bl2, corresponding to the local label contexts in 
Eqn.(7) by assuming K = 5 and n = 9, as shown in Fig.l 
(d). 

• bl4 As illustrated in Table 1 and Fig.3 (b), bl4 is a 
block min pooling layer that pools over every 1x1 region 
with one stride across every 5 input channels, leading to 
21 output channels, i.e. 105^5=21. bl4 activates the 
contextual pattern with the smallest penalty. 

• bl5 This layer combines both the unary and smooth¬ 
ness terms by summing the outputs of bll and bl4 in an 


^Each filter in bl2 actually represents a distance metric between pixels 
in a specific region. In VOC12, the patterns of all the training images 
in a specific region are heterogenous, because of various object shapes. 
Therefore, we initialize each filter with Euclidean distance. Nevertheless, 
Eqn.(8) is a more general form than the triple penalty in Eqn.(7), Le. filters 
in (8) can be automatically learned from data, if the patterns in a specific 
region are homogenous, such as face or human images, which have more 
regular shapes than images in VOC12. 


element-wise manner similar to Eqn.(7), 
exp{ln(oJi„)) 




E«Li exp { ln(oii„)) - ’ 


( 9 ) 


where probability of assigning label u to pixel i is normal¬ 
ized over all the labels. 

Relation to Previous Deep Models Many existing deep 
models such as [39, 3, 30] employed Eqn.(3) as the pairwise 
terms, which are the special cases of Eqn.(7). To see this, 
let K=1 and j=i, the right hand side of (7) reduces to 


exp{-i>“ - ^ d{i,z)p'”pl} 

v^L 

= exp{-i>“ - Y (10) 

v£L z£Afi,z^i 


where ja{u,v) and d{i,z) represent the global label co¬ 
occurrence and pairwise pixel similarity of Eqn.(3), respec¬ 
tively. This is because Ai is a constant, = 0, and 

= ii{u^v). Eqn.( 10) is the corresponding ME 
update equation of (3). 

3.3. Learning Algorithms 

Learning The first ten groups of DPN are initialized 
by VGGie^, while the last four groups can be initialized 
randomly. DPN is then fine-tuned in an incremental manner 
with four stages. During fine-tuning, all these stages solve 
the pixelwise softmax loss [22], but updating different sets 
of parameters. 

First, we add a loss function to bll and fine-tune the 
weights from bl to bll without the last four groups, in 
order to learn the unary terms. Second, to learn the 
triple relations, we stack bl2 on top of bll and update its 
parameters (Le. uji,uj 2 in the distance measure), but the 
weights of the preceding groups {i.e. bl^bll) are fixed. 
Third, bl3 and bl4 are stacked onto bl2 and similarly, 
their weights are updated with all the preceding parameters 
fixed, so as to learn the local label contexts. Finally, all the 
parameters are jointly fine-tuned. 

Implementation DPN transforms Eqn.(7) into convo¬ 
lutions and poolings in the groups from bl2 to bl5, such 
that filtering at each pixel can be performed in a parallel 
manner. Assume we have / input and /' output feature 
maps, N X N pixels, filters with s x s receptive field, and a 
mini-batch with M samples. bl2 takes a total f 'M 

operations, bl3 takes f ' f' ' N‘^ • • M operations, 

while both bl4 and bl5 require / • N‘^ • M operations. 
For example, when M=10 as in our experiment, we have 
21x512^x50^x10=1.3x10^^ operations in bl2, which 

^We use the released VGGie model, which is public available at 

http://www.robots.ox.ac.uk/~vgg/research/very_ 
deep/ 























has the highest complexity in DPN. We parallelize these 
operations using matrix multiplication on GPU as [4] did, 
bl2 can be computed within 30ms. The total runtime of the 
last four layers of DPN is 75ms. Note that convolutions in 
DPN can be further speeded up by low-rank decompositions 
[14] of the filters and model compressions [13]. 

However, direct calculation of Eqn.(7) is accelerated by 
fast Gaussian filtering [ 1 ]. For a mini-batch of ten 512 x 512 
images, a recently optimized implementation [16] takes 12 
seconds on CPU to compute one iteration of (7). Therefore, 
DPN makes (7) easier to be parallelized and speeded up. 

4. Experiments 

Dataset We evaluate the proposed approach on the PAS¬ 
CAL VOC 2012 (VOC12) [7] dataset, which contains 20 
object categories and one background category. Following 
previous works such as [12, 22, 3], we employ 10,582 
images for training, 1,449 images for validation, and 1,456 
images for testing. 

Evaluation Metrics All existing works employed mean 
pixelwise intersection-over-union (denoted as mloU) [22] 
to evaluate their performance. To fully examine the effec¬ 
tiveness of DPN, we introduce another three metrics, in¬ 
cluding tagging accuracy (TA), localization accuracy (LA), 
and boundary accuracy (BA). (1) TA compares the pre¬ 
dicted image-level tags with the ground truth tags, calculat¬ 
ing the accuracy of multi-class image classification. (2) LA 
evaluates the loU between the predicted object bounding 
boxes^ and the ground truth bounding boxes (denoted as 
bloU), measuring the precision of object localization. (3) 
For those objects that have been correctly localized, we 
compare the predicted object boundary with the ground 
truth boundary, measuring the precision of semantic bound¬ 
ary similar to [12]. 

Comparisons DPN is compared with the best¬ 
performing methods on VOC 12, including FCN [22], 
Zoom-out [25], DeepLab [3], WSSL [28], BoxSup [5], 
Piecewise [19], and RNN [39]. All these methods are 
based on CNNs and MRFs, and trained on VOC 12 data 
following [22] . They can be grouped according to different 
aspects: (1) joint-train: Piecewise and RNN; (2) w/o 
joint-train: DeepLab, WSSL, FCN, and BoxSup; (3) pre¬ 
train on COCO: RNN, WSSL, and BoxSup. The first 
and the second groups are the methods with and without 
joint training CNNs and MRFs, respectively. Methods in 
the last group also employed MS-COCO [20] to pre-train 
deep models. To conduct a comprehensive comparison, the 
performance of DPN are reported on both settings, i.e., with 
and without pre-training on COCO. 

In the following, Sec.4.1 investigates the effectiveness of 
different components of DPN on the VOC 12 validation set. 

^They are the bounding boxes of the predicted segmentation regions. 


Receptive Field 

baseline 

1 10x10 1 50x50 1 100x100 

mloU (%) 1 

63.4 

1 63.8 1 64.7 1 64.3 

(a) Comparisons between different receptive fields of bl2. 

Receptive Field 

1 1x1 1 

5x5 

9x9 9x9 mixtures 

mloU (%) 

1 64.8 1 

66.0 

1 66.3 1 66.5 

(b) Comparisons between different receptive fields of bl3. 

Pairwise Terms DSN [30] 

DeepLab [3] | DPN 

improvement (%) 2.6 

3.3 1 5.4 


(c) Comparing pairwise terms of different methods. 


Table 2: Ablation study of hyper-parameters. 

Sec.4.2 compares DPN with the state-of-the-art methods on 
the VOC 12 test set. 

4.1. Effectiveness of DPN 

All the models evaluated in this section are trained and 
tested on VOC 12. 

Triple Penalty The receptive field of bl2 indicates 
the range of triple relations for each pixel. We examine 
different settings of the receptive fields, including UOx 10’, 
‘50x50’, and ‘100x100’, as shown in Table 2 (a), where 
‘50x50’ achieves the best mloU, which is sightly better 
than ‘lOOx 100’. For a 512x512 image, this result implies 
that 50x50 neighborhood is sufficient to capture relations 
between pixels, while smaller or larger regions tend to 
under-fit or over-fit the training data. Moreover, all models 
of triple relations outperform the ‘baseline’ method that 
models dense pairwise relations, i.e. VGGie+denseCRF 
[16]. 

Label Contexts Receptive field of bl3 indicates the 
range of local label context. To evaluate its effectiveness, 
we fix the receptive field of bl2 as 50x50. As summarized 
in Table 2 (b), ‘9x9 mixtures’ improves preceding settings 
by 1.7, 0.5, and 0.2 percent respectively. We observe large 
gaps exist between ‘1x1’ and ‘5x5’. Note that the 1x1 
receptive field of bl3 corresponds to learning a global label 
co-occurrence without considering local spatial contexts. 
Table 2 (c) shows that the pairwise terms of DPN are more 
effective than DSN and DeepLab'^. 

More importantly, mloU of all the categories can be 
improved when increasing the size of receptive field and 
learning a mixture. Specifically, for each category, the im¬ 
provements of the last three settings in Table 2 (b) over the 
first one are 1.2±0.2, 1.5±0.2, and 1.7±0.3, respectively. 

We also visualize the learned label compatibilities and 
contexts in Fig.4 (a) and (b), respectively, (a) is obtained 
by summing each filter in bl3 over 9x9 region, indicating 
how likely a column object would present when a row 
object is presented. Blue represents high possibility, (a) 

"^The other deep models such as RNN and Piecewise did not report the 
exact imrprovements after combining unary and pairwise terms. 


















Figure 6: Ablation study of (a) training strategy (b) required MF 
iterations. (Best viewed in color) 


Figure 4: Visualization of (a) learned label compatibility (b) learned 
contextual information. (Best viewed in color) 



(a) Original Image (b) Ground Truth (c) Unary Term 



(d) +Triple Penalty (e) +Label Contexts (f) +Joint Tuning 


Figure 5: Step-by-step visualization of DPN. (Best viewed in color) 


is non-symmetry. For example, when ‘horse’ is presented, 
‘person’ is more likely to present than the other objects. 
Also, ‘chair’ is compatible with ‘table’ and ‘bkg’ is com¬ 
patible with all the objects, (b) visualizes some contextual 
patterns, where ‘A:B’ indicates that when ‘A’ is presented, 
where ‘B’ is more likely to present. For example, ‘bkg’ is 
around ‘train’, ‘motorbike’ is below ‘person’, and ‘person’ 
is sitting on ‘chair’. 

Incremental Learning As discussed in Sec. 3.3, DPN is 
trained in an incremental manner. The right hand side of Ta¬ 
ble 3 (a) demonstrates that each stage leads to performance 
gain compared to its previous stage. For instance, ‘triple 
penalty’ improves ‘unary term’ by 2.3 percent, while ‘label 
contexts’ improves ‘triple penalty’ by 1.8 percent. More 
importantly, joint fine-tuning all the components {i.e. unary 
terms and pairwise terms) in DPN achieves another gain 
of 1.3 percent. A step-by-step visualization is provided in 
Fig.5. 

We also compare ‘incremental learning’ with ‘joint 
learning’, which fine-tunes all the components of DPN at 
the same time. The training curves of them are plotted in 
Fig. 6 (a), showing that the former leads to higher and more 
stable accuracies with respect to different iterations, while 
the latter may get stuck at local minima. This difference 
is easy to understand, because incremental learning only 
introduces new parameters until all existing parameters 


■ Unary Term ■ Triple Penalty | Label Contexts fl Joint Tuning 


98% 73% 72 1% 



Figure 7: Stage-wise analysis of (a) mean tagging accuracy (b) mean 
localization accuracy (c) mean boundary accuracy. 

have been fine-tuned. 

One-iteration MF DPN approximates one iteration of 
MF. Fig. 6 (b) illustrates that DPN reaches a good accuracy 
with one MF iteration. A CRF [16] with dense pairwise 
edges needs more than 5 iterations to converge. It also 
has a large gap compared to DPN. Note that the existing 
deep models such as [3, 39, 30] required 5^10 iterations to 
converge as well. 

Different Components Modeling Different Informa¬ 
tion We further evaluate DPN using three metrics. The 
results are given in Fig. 7. For example, (a) illustrates that 
the tagging accuracy can be improved in the third stage, as 
it captures label co-occurrence with a mixture of contextual 
patterns. However, TA is decreased a little after the final 
stage. Since joint tuning maximizes segmentation accu¬ 
racies by optimizing all components together, extremely 
small objects, which rarely occur in VOC training set, 
are discarded. As shown in (b), accuracies of object 
localization are significantly improved in the second and the 
final stages. This is intuitive because the unary prediction 
can be refined by long-range and high-order pixel relations, 
and joint training further improves results, (c) discloses 
that the second stage also captures object boundary, since 
it measures dissimilarities between pixels. 

Per-class Analysis Table 3 (a) reports the per-class 
accuracies of four evaluation metrics, where the first four 
rows represent the mloU of four stages, while the last 
three rows represent TA, LA, and BA, respectively. We 
have several valuable observations, which motivate future 
researches. (1) Joint training benefits most of the categories, 
except animals such as ‘bird’, ‘cat’, and ‘cow’. Some 
instances of these categories are extremely small so that 












































areo 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike person plant 

sheep sofa 

train 

tv 

Avg. 

Unary Term (mloU) 

77.5 

34.1 

76.2 

58.3 

63.3 

78.1 

72.5 

76.5 

26.6 

59.9 

40.8 

70.0 

62.9 

69.3 

76.3 

39.2 

70.4 

37.6 

72.5 

57.3 

62.4 

+ Triple Penalty 

82.3 

35.9 

80.6 

60.1 

64.8 

79.5 

74.1 

80.9 

27.9 

63.5 

40.4 

73.8 

66.7 

70.8 

79.0 

42.0 

74.1 

39.1 

73.2 

58.5 

64.7 

+ Label Contexts 

83.2 

35.6 

82.6 

61.6 

65.5 

80.5 

74.3 

82.6 

29.9 

67.9 

47.5 

75.2 

70.3 

71.4 

79.6 

42.7 

77.8 

40.6 

75.3 

59.1 

66.5 

+ Joint Tuning 

84.8 

37.5 

80.7 

66.3 

67.5 

84.2 

76.4 

81.5 

33.8 

65.8 

50.4 

76.8 

67.1 

74.9 

81.1 

48.3 

75.9 

41.8 

76.6 

60.4 

67.8 

TA (tagging Ace.) 

98.8 

97.9 

98.4 

97.7 

96.1 

98.6 

95.2 

96.8 

90.1 

97.5 

95.7 

96.7 

96.3 

98.1 

93.3 

96.1 

98.7 

92.2 

97.4 

96.3 

96.4 

LA (bloU) 

81.7 

76.3 

75.5 

70.3 

54.4 

86.4 

70.6 

85.6 

51.8 

79.6 

57.1 

83.3 

79.2 

80.0 

74.1 

53.1 

79.1 

68.4 

76.3 

58.8 

72.1 

BA (boundary Ace.) 

95.9 

83.9 

96.9 

92.6 

93.8 

94.0 

95.7 

95.6 

89.5 

93.3 

91.4 

95.2 

94.2 

92.7 

94.5 

90.4 

94.8 

90.5 

93.7 

96.6 

93.3 


(a) Per-class results on VOC12 val. 



areo 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike person plant 

sheep 

sofa 

train 

tv 

mloU 

FCN [22] 

76.8 

34.2 

68.9 

49.4 

60.3 

75.3 

74.7 

77.6 

21.4 

62.5 

46.8 

71.8 

63.9 

76.5 

73.9 

45.2 

72.4 

37.4 

70.9 

55.1 

62.2 

Zoom-out [25] 

85.6 

37.3 

83.2 

62.5 

66.0 

85.1 

80.7 

84.9 

27.2 

73.2 

57.5 

78.1 

79.2 

81.1 

77.1 

53.6 

74.0 

49.2 

71.7 

63.3 

69.6 

Piecewise [19] 

87.5 

37.7 

75.8 

57.4 

72.3 

88.4 

82.6 

80.0 

33.4 

71.5 

55.0 

79.3 

78.4 

81.3 

82.7 

56.1 

79.8 

48.6 

77.1 

66.3 

70.7 

DeepLab [ ] 

84.4 

54.5 

81.5 

63.6 

65.9 

85.1 

79.1 

83.4 

30.7 

74.1 

59.8 

79.0 

76.1 

83.2 

80.8 

59.7 

82.2 

50.4 

73.1 

63.7 

71.6 

RNN [39] 

87.5 

39.0 

79.7 

64.2 

68.3 

87.6 

80.8 

84.4 

30.4 

78.2 

60.4 

80.5 

77.8 

83.1 

80.6 

59.5 

82.8 

47.8 

78.3 

67.1 

72.0 

WSSLf [28] 

89.2 

46.7 

88.5 

63.5 

68.4 

87.0 

81.2 

86.3 

32.6 

80.7 

62.4 

81.0 

81.3 

84.3 

82.1 

56.2 

84.6 

58.3 

76.2 

67.2 

73.9 

RNN'*' [39] 

90.4 

55.3 

88.7 

68.4 

69.8 

88.3 

82.4 

85.1 

32.6 

78.5 

64.4 

79.6 

81.9 

86.4 

81.8 

58.6 

82.4 

53.5 

77.4 

70.1 

74.7 

BoxSup^ [5] 

89.8 

38.0 

89.2 

68.9 

68.0 

89.6 

83.0 

87.7 

34.4 

83.6 

67.1 

81.5 

83.7 

85.2 

83.5 

58.6 

84.9 

55.8 

81.2 

70.7 

75.2 

DPN 

87.7 

59.4 

78.4 

64.9 

70.3 

89.3 

83.5 

86.1 

31.7 

79.9 

62.6 

81.9 

80.0 

83.5 

82.3 

60.5 

83.2 

53.4 

77.9 

65.0 

74.1 

DPN^ 

89.0 

61.6 

87.7 

66.8 

74.7 

91.2 

84.3 

87.6 

36.5 

86.3 

66.1 

84.4 

87.8 

85.6 

85.4 

63.6 

87.3 

61.3 

79.4 

66.4 

77.5 


(b) Per-class results on VOC12 test. The approaches pre-trained on COCO [20] are marked with 


Table 3: Per-class results on VOC12. 


joint training discards them for smoother results. (2) 
Training DPN with pixelwise label maps implicitly models 
image-level tags, since it achieves a high averaged TA of 
96.4%. (3) Object localization always helps. However, 
for the object with complex boundary such as ‘bike’, its 
mloU is low even it can be localized, e.g. ‘bike’ has 
high LA but low BA and mIoU. (4) Failures of different 
categories have different factors. With these three metrics, 
they can be easily identified. For example, the failures of 
‘chair’, ‘table’, and ‘plant’ are caused by the difficulties 
to accurately capture their bounding boxes and boundaries. 
Although ‘bottle’ and ‘tv’ are also difficult to localize, they 
achieve moderate mloU because of their regular shapes. In 
other words, mloU of ‘bottle’ and ‘tv’ can be significantly 
improved if they can be accurately localized. 

4.2. Overall Performance 

As shown in Table 3 (b), we compare DPN with the 
best-performing methods^ on VOC12 test set based on two 
settings, i.e. with and without pre-training on COCO. The 
approaches pre-trained on COCO are marked with ‘f’. We 
evaluate DPN on several scales of the images and then 
average the results following [3, 19]. 

DPN outperforms all the existing methods that were 
trained on VOC12, but DPN needs only one MF iteration 
to solve MRF, other than 10 iterations of RNN, DeepLab, 
and Piecewise. By averaging the results of two DPNs, we 
achieve 74.1% accuracy on VOC12 without outside training 
data. As discussed in Sec. 3.3, MF iteration is the most 
complex step even when it is implemented as convolutions. 
Therefore, DPN at least reduces 10 x runtime compared to 

^The results of these methods were presented in either the published 
papers or arXiv pre-prints. 


previous works. 

Following [39, 5], we pre-train DPN with COCO, where 
20 object categories that are also presented in VOC12 are 
selected for training. A single DPN^ has achieved 77.5% 
mloU on VOC12 test set. As shown in Table 3 (b), we 
observe that DPN^ achieves best performances on more 
than half of the object classes. Please refer to the appendices 
for visual quality comparisons. 

5. Conclusion 

We proposed Deep Parsing Network (DPN) to address 
semantic image segmentation, which has several appealing 
properties. First, DPN unifies the inference and learning 
of unary term and pairwise terms in a single convolutional 
network. No iterative inference are required during back- 
propagation. Second, high-order relations and mixtures 
of label contexts are incorporated to its pairwise terms 
modeling, making existing works serve as special cases. 
Third, DPN is built upon conventional operations of CNN, 
thus easy to be parallelized and speeded up. 

DPN achieves state-of-the-art performance on VOC12, 
and multiple valuable facts about semantic image segmen- 
tion are revealed through extensive experiments. Future 
directions include investigating the generalizability of DPN 
to more challenging scenarios, e.g. large number of object 
classes and substantial appearance/scale variations. 
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Appendices 


A. Fast Implementation of Locally Convolu¬ 
tion 

bl2 in DPN is a locally convolutional layer. As men¬ 
tioned in Eqn.(3), the local filters in bl2 are computed 
by the distances between RGB values of the pixels. XY 
coordinates are omitted here because they could be pre¬ 
computed. To accelerate the computation of locally convo¬ 
lution, lookup table-based filtering approach is employed. 
We first construct a lookup table storing distances between 
any two pixel intensities (ranging from 0 to 255), which 
results ina256 x 256 matrix. Then when we perform lo¬ 
cally convolution, the kernels’ coefficients can be obtained 
efficiently by just looking up the table. 

B. Visual Quality Comparisons 

In the following, we inspect visual quality of obtained 
label maps. Fig. 8 demonstrates the comparisons of DPN 
with FCN [22] and DeepFab [3]. We use the publicly 
released model^ to re-generate label maps of FCN while 
the results of DeepFab are extracted from their published 
paper. DPN generally makes more accurate predictions in 
both image-level and instance-level. 

We also include more examples of DPN label maps in 
Fig. 9. We observe that learning local label contexts helps 
differentiate confusing objects and learning triple penalty 
facilitates the capturing of intrinsic object boundaries. 


^http ://dl.caffe.berkeleyvision.org/ 
fcn-8s-pascal.caffemodel 




Figure 8: Visual quality comparison of different semantic image segmentation methods: (a) input image (b) ground truth (c) 
FCN [22] (d) DeepLab [3] and (e) DPN. 



(a) (b) (c) (a) (b) (c) 

Figure 9: Visual quality of DPN label maps: (a) input image (b) ground truth (white labels indicating ambiguous regions) 
and (c) DPN. 

































































